1 Executive Summary
Modern museums need to add interactivity to their collections to give them the edge in the digital era. We have done descriptive work that explores how a museum might communicate the scale of the collection of species around the world to their visitors with an interactive map and distance data.
2 Full Report
2.1 Initial Data Analysis (IDA)
2.1.1 Sources
The data used within this research comes from 3 separate datasets.
The first is the raw_data, sourced from The University
of Sydney’s Macleay Collections at the Chau Chak Wing Museum. This was
provided as an Excel file. The data comes from small paper labels that
are physically connected to each specimen. The main 2 fields that are of
concern is Field E and F, representing the specimen’s species/subspecies
and the Geographical place where animal was collected receptively.
Second is the location_data. This was provided by the
tutors as a csv. This dataset provides specific latitude and longitude
data for the text-based location entries seen within the raw data. The 6
main entries that will be focused on are the long, lat, north, south,
east and west columns The long and lat columns represent the geographic
center of the area in which a species was collected. The north, south,
east and west columns. represent the most northern, southern, eastern
and western point of the collection area. It can be thought that the
north, south, east and west data defines a square in which the species
was collected, and the lat and long data defines the center of this
square.
Thirdly is the taxonomy_data. This was provided by the
tutors as a csv. This dataset provides taxonomical data for the
text-based species entries seen within the raw data. This links name
provided on the label of a specimen to its kingdom, phylum, class,
order, family, genus and species.
## Initialise Libraries
library(ggplot2)
library(plotly)
library(tidyverse)
library(leaflet)
library(leaflet.extras)
library(readxl)
library(dplyr)
library(readr)
library(geosphere)
library(knitr)
library(scales)
## Read in data
raw_data =
read_xlsx("rawData.xlsx")
location_data =
read.csv("locationData.csv")
taxonomy_data =
read.csv("taxonomyData.csv")
2.1.2 Structure
Below are the dimensions of each dataset.
dimensions <- rbind(dim(raw_data), dim(location_data), dim(taxonomy_data))
colnames(dimensions) <- c("Rows", "Columns")
rownames(dimensions) <- c("Raw_Data", "Location_Data", "Taxonomy_Data")
dimensions %>% kable()
| Rows | Columns | |
|---|---|---|
| Raw_Data | 34952 | 11 |
| Location_Data | 1786 | 10 |
| Taxonomy_Data | 27069 | 10 |
Here are the data types of the variables required to answer our research question.
# raw_data variables of interest
raw_variables <- select(raw_data, c(5:6))
classes <- as.data.frame(lapply(raw_variables, class))
raw <- rownames_to_column(as.data.frame(t(classes)), "raw_data Variables")
colnames(raw)[2] <- "Type"
raw %>% kable()
| raw_data Variables | Type |
|---|---|
| Name.in.label | character |
| Locality | character |
# location_data variables of interest
location_variables <- select(location_data, c(1:2, 6:10))
classes <- as.data.frame(lapply(location_variables, class))
loc <- rownames_to_column(as.data.frame(t(classes)), "location_data Variables")
colnames(loc)[2] <- "Type"
loc %>% kable()
| location_data Variables | Type |
|---|---|
| lon | numeric |
| lat | numeric |
| north | numeric |
| south | numeric |
| east | numeric |
| west | numeric |
| uniqueLocation | character |
# taxonomy_data variables of interest
taxonomy_variables <- select(taxonomy_data, 1)
classes <- as.data.frame(lapply(taxonomy_variables, class))
tax <- rownames_to_column(as.data.frame(t(classes)), "taxonomy_data Variables")
colnames(tax)[2] <- "Type"
tax %>% kable()
| taxonomy_data Variables | Type |
|---|---|
| user_supplied_name | character |
2.1.3 Limitations and Assumptions
The primary limitation on our research is the precision and presence
of location data for each specimen in the raw_data dataset.
Some specimens have no locality information present in their labels
which means we cannot use these in our analysis. For other specimens,
the precise location of collection is unknown, so a broad area is
applied for these specimens. Including these in our analysis would lead
unjustifiable conclusions, so we have no way of knowing how accurate the
distances are for those specimens, so we have to apply a benchmark
precision of locations for specimens to include in our analysis (see
section 2.2).
An assumption we cannot make is that specimens of the same species in the collections came from the same location. This is reflected in the structure of the raw data, where different specimens of the same species do not necessarily have localities associated with them.
2.1.4 Data Cleaning
filtered_raw_data <- raw_data %>%
filter(is.na(Locality) == FALSE) %>%
filter(!grepl("\\[|\\]", Locality)) # [] in the locality column enclose text saying that the location is unknown
cat(sprintf("%s of the specimens in the raw data have a location assigned to it",
label_percent()(nrow(filtered_raw_data)/nrow(raw_data))))
49% of the specimens in the raw data have a location assigned to it
We needed to generate new information using the data to better address our research question.
The first was to generate new columns in the location data that
represented the distance between a specimen’s collection location and
the Chau Chak Wing Museum. The minimum and maximum distance based on the
collection location square was also found and were added to the location
data.This can be used to calculate the precision of the location and
hence filter our data to those with precise locations. This was done by
defining functions find_distance,
find_max_distance and find_min_distance
After this, a function named long_for_world was defined
to transform the longitude coordinate of the data such that the Museum
was in the center of the map. This is important for the visualisation of
the datasets on a map.
museum_lat <- -33.885872
museum_long <- 151.189392151
## Create empty vectors for new data columns
distance_vec <- c()
max_distance_vec <- c()
min_distance_vec <- c()
long_for_world_vec <- c()
## Creates function that finds the distance from a point to the Museum
find_distance <- function(long, lat) {
return(
distCosine(
matrix(c(
pull(location_data, long)[val],
pull(location_data, lat)[val]),
nrow=1, ncol=2),
matrix(c(
museum_long,
museum_lat),
nrow=1, ncol=2)))
}
## Creates function that finds the maximum distance from a point to the Museum as defined by the corners of the collection site square
find_max_distance <- function(val) {
return(
max(
c(find_distance("east","north"),
find_distance("east","south"),
find_distance("west","north"),
find_distance("west","south"))))
}
## Creates function that finds the minimum distance from a point to the Museum as defined by the corners of the collection site square
find_min_distance <- function(val) {
return(
min(
c(find_distance("east","north"),
find_distance("east","south"),
find_distance("west","north"),
find_distance("west","south"))))
}
## Creates function that centers the world map's longitude to the Museum
long_for_world <- function(val){
if (pull(location_data, Longitude)[val] < (museum_long-180) & is.na(pull(location_data, Longitude)[val]) == FALSE){
return(pull(location_data, Longitude)[val]+360)
} else {
return(pull(location_data, Longitude)[val])
}
}
## Mutates the location data to make the Latitude and Longitude numeric
location_data <- location_data %>%
mutate(Latitude = as.numeric(lat)) %>%
mutate(Longitude = as.numeric(lon))
## Iterating loop that calculates the distances and new longitude for each row in the location_data data frame
for (val in seq(1, dim(location_data)[1])){
distance_vec <-
append(distance_vec, find_distance("Longitude", "Latitude"))
max_distance_vec <-
append(max_distance_vec, find_max_distance(val))
min_distance_vec <-
append(min_distance_vec, find_min_distance(val))
long_for_world_vec <-
append(long_for_world_vec, long_for_world(val))
}
## Adds the 4 new columns to the location_data data frame
location_data <- location_data %>%
mutate(distance = distance_vec) %>%
mutate(min_distance = min_distance_vec) %>%
mutate(max_distance = max_distance_vec) %>%
mutate(Longitude = long_for_world_vec)
Finally, the three datasets were merged into one, named
merged_data.
## Merges the raw_data and location_data data frames into the merged_data data frame
merged_data <-
left_join(raw_data, location_data,
by =c("Locality"="uniqueLocation"))
## Merges the merged_data and taxonomy_data data frames
merged_data <-
left_join(merged_data, taxonomy_data,
by = c("Name in label"="user_supplied_name"))
A final summary of the dimensions of the data was also created
merged_dim <- data.frame(Rows = nrow(merged_data),
Columns = ncol(merged_data),
row.names = "Merged_Data")
dimensions <- rbind(dimensions, merged_dim)
dimensions %>% kable()
| Rows | Columns | |
|---|---|---|
| Raw_Data | 34952 | 11 |
| Location_Data | 1786 | 10 |
| Taxonomy_Data | 27069 | 10 |
| Merged_Data | 34952 | 34 |
2.2 Research Questions and Analysis
2.2.1 Reasearch Question
What is the locational distribution of specimens in the Macleay Collection?
2.2.1.1 Research Subquestion 1
What is the distribution of distances between a collection site and the Chau Chak Museum for the specimens in the Macleay Collection?
2.2.1.2 Research Subquestion 2
Is there an effective way to visualise the locational information that is found?
2.2.2 Analysis
Two histogram were created to analyse Research Sub-question 1.
The first is a histogram that shows the distance between a specimen’s collection site and the Chau Chak Wing Museum. This was restricted to specimens found in Australia, who’s distance is known to a precision of ±5%. The next histogram has an identical format, but now includes data from all around the world.
## Creates a data frame that has all specimens found in Australia that are known to a 10% collection distance precision
australia_precise_data <- merged_data %>%
filter(distance > 0) %>%
filter((north < museum_lat) |
(south > museum_lat) |
(east < museum_long) |
(west > museum_long)) %>%
filter(((max_distance - min_distance)/distance) < .1) %>%
filter(Latitude > -50) %>%
filter(Latitude < -10) %>%
filter(Longitude > 100) %>%
filter(Longitude < 155)
## Prints a numerical summary for the distances of the australia_precise_data data frame
summary_australia_df =
summary(pull(australia_precise_data, "distance")) %>%
as.matrix() %>%
t() %>%
as.data.frame()
summary_australia_df_km <-
summary_australia_df[1,]/1000
rownames(summary_australia_df) <- NULL
summary_australia_df_km %>%
kable(caption = "Numerical Distance Summary for Australian Data (in km)",
digits = 0,
format.args = list(big.mark = ","),
col.names = c('Minimum (km)',
'1st Quartile (km)',
'Median (km)',
'Mean (km)',
'3rd Quartile (km)',
'Maximum (km)'))
| Minimum (km) | 1st Quartile (km) | Median (km) | Mean (km) | 3rd Quartile (km) | Maximum (km) |
|---|---|---|---|---|---|
| 2 | 320 | 969 | 1,355 | 2,141 | 3,809 |
## Makes a distance histogram of the australia_precise_data data frame
australia_precise_data %>%
ggplot(aes(x=distance/1000)) +
geom_histogram(bins = 20,
fill = "steelblue1",
color = "black") +
labs(x="Distance of Specimen Collection Site from Chau Chak Wing Museum (km)",
y="Frequency",
title="Distance Distribution of Collection Sites in Australia from Chau Chak Wing Museum") +
theme(legend.position="none",
plot.background = element_rect(fill = "#f7f7f7",
size = 0),
axis.title = element_text(face="bold"),
plot.title = element_text(face="bold",
size = 13,
hjust = 0.5))
As can be seen in the above histogram, a large cluster of specimens were collected within about 1,500 km of the Chau Chak Wing Museum, while a smaller cluster of specimens were collected between approximately 2,700 km and 4,000 km from the museum. Separating these two clusters, a large spike can be seen at a distance of 2,000 km. A small cluster centered at about 1,700 km from the museum can also be seen. By comparing the sizes of these clusters, we can see that data appears to aggregate on the left hand side of the graph. This is supported by the five number summary pictured above the histogram, which is characteristic of right-skewed data as the median is significantly closer to the first quartile than to the third. The number summary additionally shows that the collection site closest to the museum is at a distance of 2 km, while the Australian collection site furthest from the museum is at a distance of 3,809 km. While the geographic context of these distances cannot be definitively determined from the histogram and numerical summary, these elements directly address Research Sub-question 1 for Australian specimens. Due to the right-skewness of the data, it can be concluded that most Australian specimens were collected in areas of Australia relatively proximal to the museum.
## Creates a data frame that has all specimens found in across the World that are known to a 10% collection distance precision
world_precise_data <- merged_data %>%
filter(distance > 0) %>%
filter((north < museum_lat) |
(south > museum_lat) |
(east < museum_long) |
(west > museum_long)) %>%
filter(((max_distance - min_distance)/distance) < .1)
## Prints a numerical summary for the distances of the world_precise_data data frame
summary_world_df =
summary(pull(world_precise_data, "distance")) %>%
as.matrix() %>%
t() %>%
as.data.frame()
summary_world_df_km <-
summary_world_df[1,]/1000
rownames(summary_world_df) <- NULL
summary_world_df_km %>%
kable(caption = "Numerical Distance Summary for World Data (in km)",
digits = 0,
format.args = list(big.mark = ","),
col.names = c('Minimum (km)',
'1st Quartile (km)',
'Median (km)',
'Mean (km)',
'3rd Quartile (km)',
'Maximum (km)'))
| Minimum (km) | 1st Quartile (km) | Median (km) | Mean (km) | 3rd Quartile (km) | Maximum (km) |
|---|---|---|---|---|---|
| 2 | 1,965 | 12,190 | 9,524 | 15,674 | 19,524 |
## Makes a distance histogram of the world_precise_data data frame
world_precise_data %>%
ggplot(aes(x=distance/1000)) +
geom_histogram(bins = 20,
fill = "steelblue1",
color = "black") +
labs(x="Distance of Specimen Collection Site from Chau Chak Wing Museum (km)",
y="Frequency",
title="Distance Distribution of Collection Sites from Chau Chak Wing Museum") +
theme(legend.position="none",
plot.background = element_rect(fill = "#f7f7f7",
size = 0),
axis.title = element_text(face="bold"),
plot.title = element_text(face="bold",
size = 13,
hjust = 0.5))
The above histogram is identical to the Australian histogram, however data from all around the world has now been included. Three main clusters of specimens can be seen; one within 5,000 km of the museum, a smaller cluster spanning a distance of about 10,000 km to 14,000 km from the museum, and a larger cluster spanning a distance of about 14,000 km to 18,000 km from the museum. Given that Australian data is included in this histogram, and the maximum distance within Australia was almost 4,000 km from the museum, it is likely that the first of the clusters is largely comprised of Australian specimens. It should also be noted that this graph appears to be left-skewed. This is supported by the five number summary printed above the histogram, where the median is closer to the third quartile than to the first. The numerical summary additionally shows the minimum distance to be 2 km — evidently, it is the same as in the Australian data as this specimen was collected in Australia. The maximum however, is now 19,524 km, which is almost the furthest possible distance that can separate two points on the Earth’s surface. Given this value is not an outlier*, we can conclude Macleay’s collection is well distributed along the full circumference of the Earth. Ultimately this histogram and numerical summary have directly addressed Research Sub-question 1.
*The distance of 19,524 km has been disregarded as an outlier using
the below definition for the maximum possible non-outlier value:
Q3 + 1.5*IQR = 36,237.5
Finally, a heat map of where the specimens were found has been generated. This was also restricted to distance data that was known to a precision of 10%. See the layer control to visualise Australia’s and the World’s data as well as the visualisation of the numerical summaries tables. Clicking on a circle will show the distance it is from the Museum.
## Defines a marker that has a museum icon
museumIcon <- makeIcon(
iconUrl = "https://cdn-icons-png.flaticon.com/512/2385/2385397.png",
iconWidth = 25, iconHeight = 25,
iconAnchorX = 12.5, iconAnchorY = 12.5,
shadowUrl = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Antonia_Sautter_Creations.png/120px-Antonia_Sautter_Creations.png",
shadowWidth = 50, shadowHeight = 64,
shadowAnchorX = 4, shadowAnchorY = 62
)
#$ Creates the main heat map
map <- leaflet() %>%
addProviderTiles("CartoDB") %>%
## Adds the colouring for the density of specimens
addHeatmap(data = australia_precise_data,
lng = ~Longitude,
lat = ~Latitude,
radius = 6,
gradient = "GnBu",
#gradient = wes_new,
cellSize = 6,
blur = 10,
group = "Australia") %>%
addHeatmap(data = world_precise_data,
lng = ~Longitude,
lat = ~Latitude,
radius = 6,
gradient = "GnBu",
cellSize = 6,
blur = 10,
group = "World") %>%
## Creates a marker so that the location of the museum is known
addMarkers(lng = museum_long,
lat = museum_lat,
icon = museumIcon) %>%
## Creates circles to get a measure of distances on the map
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "black",
radius = 1000000,
group = "Australia Distance Circles",
popup = "1000 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "black",
radius = 2000000,
group = "Australia Distance Circles",
popup = "2000 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "black",
radius = 3000000,
group = "Australia Distance Circles",
popup = "3000 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 2, fill = FALSE,
dashArray = c("10, 20"),
color = "black",
radius = 500000,
group = "Australia Distance Circles",
popup = "500 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 2, fill = FALSE,
dashArray = c("10, 20"),
color = "black",
radius = 1500000,
group = "Australia Distance Circles",
popup = "1500 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 2, fill = FALSE,
dashArray = c("10, 20"),
color = "black",
radius = 2500000,
group = "Australia Distance Circles",
popup = "2500 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 2, fill = FALSE,
dashArray = c("10, 20"),
color = "black",
radius = 3500000,
group = "Australia Distance Circles",
popup = "3500 km")%>%
## Creates circle that correspond to the above numerical summary of the Australian and World data sets
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#2b9e19",
radius = summary_australia_df[1, "1st Qu."],
group = "Australia Numeric Summary Circles",
popup = "1st Quartile - 320 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#b3ad19",
radius = summary_australia_df[1, "Median"],
group = "Australia Numeric Summary Circles",
popup = "Median - 969 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#eb5210",
radius = summary_australia_df[1, "Mean"],
group = "Australia Numeric Summary Circles",
popup = "Mean - 1,355 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#b51818",
radius = summary_australia_df[1, "3rd Qu."],
group = "Australia Numeric Summary Circles",
popup = "3rd Quartile - 2,141 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#2b9e19",
radius = summary_world_df[1, "1st Qu."],
group = "World Numeric Summary Circles",
popup = "1st Quartile - 1,965 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#eb5210",
radius = summary_world_df[1, "Median"],
group = "World Numeric Summary Circles",
popup = "Median - 12,190 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#b3ad19",
radius = summary_world_df[1, "Mean"],
group = "World Numeric Summary Circles",
popup = "Mean - 9,524 km") %>%
addCircles(lng = museum_long,
lat = museum_lat,
weight = 4,
fill = FALSE,
dashArray = c("20, 20"),
color = "#b51818",
radius = summary_world_df[1, "3rd Qu."],
group = "World Numeric Summary Circles",
popup = "3rd Quartile - 15,674 km") %>%
## Adds legends for the numerical summary circles
addLegend("bottomleft",
group = "Australia Numeric Summary Circles",
pal = colorFactor(c("#2b9e19",
"#b3ad19",
"#eb5210",
"#b51818"),
levels = c("1st Quartile",
"Median",
"Mean",
"3rd Quartile")),
values = c("1st Quartile",
"Median",
"Mean",
"3rd Quartile"),
opacity = 1,
title = "Numeric Summary Circles") %>%
addLegend("bottomleft",
group = "World Numeric Summary Circles",
pal = colorFactor(c("#2b9e19",
"#b3ad19",
"#eb5210",
"#b51818"),
levels = c("1st Quartile",
"Mean",
"Median",
"3rd Quartile")),
values = c("1st Quartile",
"Mean",
"Median",
"3rd Quartile"),
opacity = 1,
title = "Numeric Summary Circles") %>%
## Generates layer control to be able to select which elements are drawn on the map
addLayersControl(
baseGroups = c("Australia", "World"),
overlayGroups = c("Australia Distance Circles",
"Australia Numeric Summary Circles",
"World Numeric Summary Circles"),
options = layersControlOptions(collapsed = TRUE,
sortLayers = TRUE)) %>%
hideGroup(c("Australia Distance Circles",
"Australia Numeric Summary Circles",
"World Numeric Summary Circles")) %>%
## Sets the view to Australia
setView(lng = 134.189392151,
lat = -27.885872,
zoom = 4)
## Prints the map
map
The interactive heat map above directly addresses Research Sub-question 2, as the geographic distribution of the Macleay collection is presented clearly in an interactive manner. Additionally, the map includes useful distance information which can be accessed by clicking the control panel in the top-right corner. Ultimately, the map accessibly presents geographic and distance information of the collection, thereby making it an effective visualisation of the location data.
The map adds geographic context to the distance results provided by the histograms. When looking at Australian data, the map shows that the cluster within 1,500 km of the museum comes from specimens which were collected in New South Wales, south Queensland, south-east South Australia, Victoria and Tasmania. The majority of Australian specimens were found in these areas, particularly in New South Wales. It can also be seen that the cluster observed over 2,700 km from the museum comes from specimens collected in Western Australia, particularly in Perth, while the the spike at a distance of 2,000 km from the museum comes from specimens collected in Cairns. The small cluster centered at 1,700 km from the museum can be seen to stem from specimens collected mainly in Townsville, Queensland.
Similarly, geographic context can also be provided for the global distance histogram using the map. The map shows the cluster within 5,000 km of the museum is largely located in Australia along with neighboring islands (as was expected from the histogram data). It can also be seen that the small and large clusters observed between 10,000 km and 18,000 km from the museum belong to specimens collected in North America and Europe respectively. The world histogram additionally shows a region of low specimen frequency between about 5,000 km and 10,000 km from the museum. The heat map shows that this is due to the fact that this region largely comprises of ocean. As outlined by Leonardi and colleagues (2021), only a few insect species inhabit the ocean, despite being the most successful group of organisms in the history of the Earth.
Interestingly, the majority of agents from which John Macleay obtained specimens from were located in Europe, North America and along the east coast of Australia (see Figure 1 and 2 in the Appendix) (Villie et al., 2020). This provides a potential explanation for the distributions pictured in the above heat map, as the insect specimens concentrate around these three main areas.
By integrating the distance data collected in the histograms with the geographic data provided by the heat map, the geographic distribution of collection sites for specimens in the Macleay Collection is comprehensibly summarised. Our overarching research question has hence been addressed.
2.4 References
Leonardi, M. S., Crespo, J. E., Soto, F., & Lazzari, C. R. (2021). How Did Seal Lice Turn into the Only Truly Marine Insects? Insects (Basel, Switzerland), 13(1), 46–. https://doi.org/10.3390/insects13010046
Ville, S., Wright, C., & Philp, J. (2020). Correction to: Macleay’s Choice: Transacting the Natural History Trade in the Nineteenth Century. Journal of the History of Biology, 53(3), 377–378. https://doi.org/10.1007/s10739-020-09613-6